pdf 相关

一、HTML页面渲染PDF
根据html页面渲染pdf，我使用过以下两种方案:
wkhtmltopdf
chromedp
1. 使用wkhtmltopdf渲染pdf
wkhtmltopdf是一个命令行工具,用于将HTML页面渲染为PDF，基于Qt WebKit渲染引擎实现
使用方式比较简单：
`## 将一个静态html页面打印成pdf`

`$ wkhtmltopdf input.html output.pdf`

`## 将一个网页打印成pdf`

`$ wkhtmltopdf https:``//www``.google.com output.pdf`

wkhtmltopdf的参数很丰富,比如:

支持发送 http post请求，适合将自定义开发的网页渲染成pdf文件:

`$ wkhtmltopdf --help`

`...`

`--post <name> <value>      Add an additional post field (repeatable)`

`...`

支持javascript脚本，在渲染pdf前对html进行修改:

1

`$ wkhtmltopdf --run-script "javascript:(function(){document.getElementsByClassName('dom_class_name')[0].style.display = 'none'}())" page input.html output.pdf`

更多详细参数可看[官网文档](https://wkhtmltopdf.org/usage/wkhtmltopdf.txt)

如果你使用Go语言，还有一个第三方包，是对wkhtmltopdf的使用封装:[go-wkhtmltopdf](https://github.com/SebastiaanKlippert/go-wkhtmltopdf)

**2\. 使用chromedp渲染pdf**

[chromedp](https://github.com/chromedp/chromedp)是一种在Go语言中以更快，更简单的方式来驱动支持Chrome DevTools协议的浏览器的软件包，而无需外部依赖((例如Selenium或PhantomJS).

使用方式:

`package main`

`import (`

`  ``"context"`

`  ``"io/ioutil"`

`  ``"github.com/chromedp/cdproto/page"`

`  ``"github.com/chromedp/chromedp"`

`  ``"errors"`

`)`

`func main(){`

`  ``err := ChromedpPrintPdf("[https://www.google.com](https://www.google.com/)", "/path/to/file.pdf")`

`  ``if err != nil {`

`    ``fmt.Println(err)`

`    ``return`

`  ``}`

`}`

`func ChromedpPrintPdf(url string, to string) error {`

`  ``ctx, cancel := chromedp.NewContext(context.Background())`

`  ``defer cancel()`

`  ``var buf []byte`

`  ``err := chromedp.Run(ctx, chromedp.Tasks{`

`    ``chromedp.Navigate(url),`

`    ``chromedp.WaitReady("body"),`

`    ``chromedp.ActionFunc(func(ctx context.Context) error {`

`      ``var err error`

`      ``buf, _, err = page.PrintToPDF().`

`        ``Do(ctx)`

`      ``return err`

`    ``}),`

`  ``})`

`  ``if err != nil {`

`    ``return fmt.Errorf("chromedp Run failed,err:%+v", err)`

`  ``}`

`  ``if err := ioutil.WriteFile(to, buf, 0644); err != nil {`

`    ``return fmt.Errorf("write to file failed,err:%+v", err)`

`  ``}`

`  ``return nil`

`}`

**二、PDF加水印**

我了解到的支持pdf加水印的工具有:

* unidoc/unipdf
* pdfcpu

**1.unidoc/unipdf**

[unidoc](https://unidoc.io/)平台开发的[unipdf](https://unidoc.io/unipdf/)是一款用Go语言编写的PDF库,提供API和CLI使用模式,支持以下功能:

`$ unipdf -h`

`...`

`Available Commands:`

` ``decrypt   Decrypt PDF files`

` ``encrypt   Encrypt PDF files`

` ``explode   Explodes the input file into separate single page PDF files`

` ``extract   Extract PDF resources`

` ``form    PDF form operations`

` ``grayscale  Convert PDF to grayscale`

` ``help    Help about any command`

` ``info    Output PDF information`

` ``merge    Merge PDF files`

` ``optimize  Optimize PDF files`

` ``passwd   Change PDF passwords`

` ``rotate   Rotate PDF file pages`

` ``search   Search text in PDF files`

` ``split    Split PDF files`

` ``version   Output version information and exit`

` ``watermark  Add watermark to PDF files`

`...`

CLI模式添加水印

`$ unipdf watermark in.pdf watermark.png -o out.pdf`

`Watermark successfully applied to in.pdf`

`Output file saved to out.pdf`

使用API添加水印,可以直接参考unipdf github example

注意:unidoc的产品需要付费购买license使用

**2.pdfcpu**

[pdfcpu](https://github.com/pdfcpu/pdfcpu) 是一个用Go语言编写的PDF处理库，提供API和CLI模式使用

支持以下功能:

`$ pdfcpu help`

`...`

`The commands are:`

`  ``attachments list, add, remove, extract embedded file attachments`

`  ``changeopw  change owner password`

`  ``changeupw  change user password`

`  ``decrypt   remove password protection`

`  ``encrypt   set password protection`

`  ``extract   extract images, fonts, content, pages, metadata`

`  ``fonts    install, list supported fonts`

`  ``grid    rearrange pages or images for enhanced browsing experience`

`  ``import   import/convert images to PDF`

`  ``info    print file info`

`  ``merge    concatenate 2 or more PDFs`

`  ``nup     rearrange pages or images for reduced number of pages`

`  ``optimize  optimize PDF by getting rid of redundant page resources`

`  ``pages    insert, remove selected pages`

`  ``paper    print list of supported paper sizes`

`  ``permissions list, set user access permissions`

`  ``rotate   rotate pages`

`  ``split    split multi-page PDF into several PDFs according to split span`

`  ``stamp    add, remove, update text, image or PDF stamps for selected pages`

`  ``trim    create trimmed version of selected pages`

`  ``validate  validate PDF against PDF 32000-1:2008 (PDF 1.7)`

`  ``version   print version`

`  ``watermark  add, remove, update text, image or PDF watermarks for selected pages`

`...`

使用CLI工具以图片形式添加水印:

1

`$ pdfcpu watermark add -mode image 'voucher_watermark.png' 's:1 abs, rot:0' in.pdf out.pdf`

调用api添加水印

`package main`

`import (`

`  ``"github.com/pdfcpu/pdfcpu/pkg/api"`

`  ``"github.com/pdfcpu/pdfcpu/pkg/pdfcpu"`

`)`

`func main() {`

`  ``onTop := false`

`  ``wm, _ := pdfcpu.ParseImageWatermarkDetails("watermark.png", "s:1 abs, rot:0", onTop)`

`  ``api.AddWatermarksFile("in.pdf", "out.pdf", nil, wm, nil)`

`}`

**三、PDF合并**

* cpdf
* unipdfc
* pdfcpu

**1.使用cpdf合并pdf**

[cpdf](https://community.coherentpdf.com/)是一个开源免费的PDF命令行工具库，有丰富的功能，比如:

* Merge PDF files together, or split them apart
* Encrypt and decrypt
* Scale, crop and rotate pages
* Read and set document info and metadata
* Copy, add or remove bookmarks
* Stamp logos, text, dates, page numbers
* Add or remove attachments
* Losslessly compress PDF files

合并pdf:

1

`$ cpdf -merge input1.pdf input2.pdf -o output.pdf`

**2.使用unipdf合并pdf**

1

`$ unipdf merge output.pdf input1.pdf input2.pdf`

使用API合并pdf，参考unpdf github example

**3.使用pdfcpu合并pdf**

1

`$ pdfcpu merge output.pdf input1.pdf input2.pdf`

注意: pdfcpu只支持版本低于PDF V1.7的pdf文件

**四、拆分PDF**

* cpdf
* unipdf
* pdfcpu

**1.使用cpdf拆分pdf**

1

2

`## 逐页拆分成单个pdf`

`$ cpdf -split in.pdf 1 even -chunk 1 -o ./out%%%.pdf`

**2\. 使用unipdf拆分pdf**

1

2

`## 将第一页拆分出来`

`$ unipdf split input.pdf out.pdf 1-1`

使用api拆分pdf，参考[unipdf github examples](https://github.com/unidoc/unipdf-examples/blob/v3/pages/pdf_split.go)

**3.使用pdfcpu拆分pdf**

1

`$ pdfcpu split in.pdf .`

**五、PDF转图片**

* mupdf
* xpdf

**1\. 使用mupdf操作pdf转图片**

[MuPDF ](https://www.mupdf.com/index.html)is a lightweight PDF, XPS, and E-book viewer.
MuPDF consists of a software library, command line tools, and viewers for various platforms.

下载mupdf后得到一些工具,比如:

> mupdf 
> pdfdraw
> pdfinfo 
> pdfclean 
> pdfextract 
> pdfshow 
> xpsdraw

其中pdfdraw可用来转换图片

1

`$ pdfdraw -o out%d.png in.pdf`

注意: mupdf不支持mac OS

**2\. 使用xpdf操作pdf转图片**

[xpdf](https://www.xpdfreader.com/)是一个免费的PDF工具包，包括文字解析,图片转换,html转换等

下载该软件包后，可以得到一系列的工具:

> pdfdetach
> pdffonts 
> pdfimages
> pdfinfo 
> pdftohtml
> pdftopng 
> pdftoppm 
> pdftops 
> pdftotext

从名称上看，大致能看出来每一个工具的用处

1

2

`## 使用pdftopng将pdf转换成png`

`$ pdftopng in.pdf out-prefix`

**六、PDF解密**

经常会遇到一种场景，读取pdf文件的时候发现会报错:文件被加密

但是在没有密码的情况下怎么解决呢?

* 使用qpdf解密

使用[qpdf](http://qpdf.sourceforge.net/)进行强制解密，有些情况是可以解密成功的，但是有些情况也不一定能解密成功

qpdf是一个支持命令行的pdf工具

1

`$ qpdf --decrypt in.pdf out.pdf`

使用pdfcpu解密

1

`$ pdfcpu decrypt encrypted.pdf output.pdf`

当有密码的情况下，可以使用密码解密:

使用unipdf解密pdf

1

`$ unipdf decrypt -p pass -o output.pdf input.pdf`

**七、PDF识别**

经常会遇到一些场景，比如识别一个文件是不是pdf文件，识别pdf中的文字，识别pdf中的图片等

**1.识别pdf中的文字**

这里使用xpdf将pdf中的文字解析出来，然后再使用一些字符串操作或者正则表达式进行业务分析

使用xpdf/pdftotext解析pdf中的文本

1

`$ pdftotext input.pdf output.txt`

使用unipdf解析pdf中的文本

1

`$ unipdf extract text input.pdf`

使用API解析pdf文本，参考[unipdf github examples](https://github.com/unidoc/unipdf-examples/blob/v3/text/pdf_extract_text.go)

使用坐标信息解析pdf数据

上面都是先解析出pdf的文本，再根据业务进行处理

还有一种方式是按照坐标位置解析pdf，这种方式更加灵活以及通用，利用的是[pdflib/tet](https://www.pdflib.com/products/tet)

1

2

`## 输入一组坐标，即可按照坐标解析pdf中的数据`

`$ tet --pageopt "includebox={{38 707.93 243.91 716.93}}" input.pdf`

坐标可以使用tet对pdf进行分析得到一个tetml文件，里面包含了坐标信息:

1

`$ tet --tetml input.pdf`

当然也可以用一些其他的方式获取pdf中数据的坐标信息，比如nodejs等

注意: pdflib/tet是收费软件，但是根据官方文档说明,tet提供基础功能，处理不超过10页或者小于1M的pdf文件是不需要购买license的

pdflib/tet提供了命令行工具以及多种语言的sdk支持，比如C/C++/Java/.NET/Perl/PHP/Python/Ruby/Swift 但目前还不支持Go语言，所以对于gopher而言目前只有两种选择:CLI OR CGO

**八、修复受损PDF文件**

有一些pdf文件在电脑上打开时，显示正常，但是用代码检测却是不正常的,比如在Go中尝试用一个第三方库去解析一个(受损的)pdf:

`import (`

`  ``"fmt"`

`  ``"github.com/rsc.io/pdf"`

`)`

`func main() {`

`  ``filePath := "path/to/your/broken.pdf"`

`  ``_, err := pdf.Open(filePath)`

`  ``if err != nil {`

`    ``fmt.Println("open pdf failed,err:", err.Error())`

`    ``return`

`  ``}`

`}`

运行后会得到这样一个结果:

> open pdf failed,err: malformed PDF: cross-reference table not found: {5 0 obj}\<\</Contents 6 0 R /Group \<\</CS /DeviceRGB /S /Transparency /Type /Group\>\> /MediaBox [0 0 595.27600098 841.89001465] /Parent 3 0 R /Type /Page\>\>

电脑打开正常，程序却读取错误!

这时候如果尝试在电脑上打开pdf，然后另存为一个新的pdf文件，再用代码去检测，会发现竟然修复了！

太好了，问题解决！

等等，如果我有1000张pdf文件，难道要逐个打开并另存为？这怎么能忍? 所以如果有一种批量修复的功能就好了

在网上找了很久，大概得到三种解决方案:

* 利用 Acrobat SDK,调用SDK中的[另存为功能](https://www.jb51.net/article/48171.htm)，可以实现电脑打开另存为的效果
* 利用ghostscript进行pdf修复
* 利用[mupdf](https://www.mupdf.com/index.html)进行pdf修复

这里我只验证了第三种方式是可行的,这里我使用mupdf-0.9-linux-amd64这个版本进行验证

下载软件包后,得到其中一个可执行文件:pdfclean

`$ pdfclean broken.pdf repaired.pdf`

`+ pdf/pdf_xref.c:160: pdf_read_trailer(): cannot recognize xref format: '%'`

`| pdf/pdf_xref.c:481: pdf_load_xref(): cannot read trailer`

`\ pdf/pdf_xref.c:537: pdf_open_xref_with_stream(): trying to repair`

从输出结果来看，mupdf尝试了修复处理

得到新的pdf文件之后，再用前面的Go代码尝试打开，就正常了

剩下的就是写一个bash脚本，批量修复，目标达成！

**九、识别一个PDF文件的字体信息**

有时候要使多个pdf文本字体保持一致，免不得要去分析pdf中都使用了哪些字体，这时候可以使用xpdf/pdffonts进行字体分析

`$ pdffonts input.pdf`

`name                 type       encoding     emb sub uni object ID`

`------------------------------------ ----------------- ---------------- --- --- --- ---------`

`NimbusSanL-Regu           CID TrueType   Identity-H    yes no yes   10 0`

`NimbusSanL-Bold           CID TrueType   Identity-H    yes no yes   20 0`

**其他Libiray介绍:**

[PDF-Writer](https://github.com/galkahana/PDF-Writer/wiki)
这是一个C++的开源库，支持创建pdf，合并pdf，图片水印文字操作等

对于gopher来讲，要使用这个库，需要封装一层CGO代码才可以

[rsc/pdf](https://github.com/rsc/pdf)
这是一个Go语言实现的pdf库，可以用于读取pdf信息，比如读取pdf内容/页数/字体等... 具体可以参考[文档](https://godoc.org/rsc.io/pdf)

介绍了这么多第三方库，简直就是五花八门，各显神通。有些功能在大多数库中都是有重复的，具体使用中会遇到什么问题，还是要看实际情况如何。

希望这些总结能够对读者有所帮助